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Abstract 

This paper presents new results for the (partial) maximum a posteriori (MAP) problem in Bayesian networks, which 
is the problem of querying the most probable state configuration of some of the network variables given evidence. 
First, it is demonstrated that the problem remains hard even in networks with very simple topology, such as binary 
polytrees and simple trees (including the Naive Bayes structure). Such proofs extend previous complexity results for 
the problem. Inapproximability results are also derived in the case of trees if the number of states per variable is not 
bounded. Although the problem is shown to be hard and inapproximable even in very simple scenarios, a new exact 
algorithm is described that is empirically fast in networks of bounded treewidth and bounded number of states per 
variable. The same algorithm is used as basis of a Fully Polynomial Time Approximation Scheme for MAP under 
such assumptions. Approximation schemes were generally thought to be impossible for this problem, but we show 
otherwise for classes of networks that are important in practice. The algorithms are extensively tested using some 
-^ well-known networks as well as random generated cases to show their effectiveness. 

1. Introduction 

A Bayesian network (BN) is a probabilistic graphical model that relies on a structured dependency among random 
tH- ' variables to represent a joint probability distribution in a compact and efficient manner. It is composed of a directed 

acyclic graph (DAG) where nodes are associated to random variables and conditional probability distributions are 
defined for variables given their parents in the graph. One of the hardest inference problems in BNs is the maximum a 
posteriori (or MAP) problem, where one looks for states of some variables that maximize their joint probability, given 
some other variables as evidence (there may exist variables that are neither queried nor part of the evidence). This 
problem is known to be NP pp -complete in the general case and NP-complete for polytrees QjJ, [2|] . Thus, algorithms 
usually take large amount of time to solve MAP even in small networks. Approximating MAP in polytrees is also 
NP-hard. However, such hardness results are derived for networks with a large number of states per variable, which 
is not the most common situation in many practical problems. In this paper we consider the case where the number 
of states per variable is bounded. We prove that the problem remains hard even in binary polytrees and simple 
trees, using reductions from both the satisfiability and the partition problems, but we also show that there is a Fully 
Polynomial Time Approximation Scheme (FPTAS) whenever the treewidth and number of states are bounded, so we 
may expect fast algorithms for MAP with a small approximation error under such assumptions. A new exact algorithm 
is presented, which also delivers approximations with theoretically bounded errors. Empirical results show that this 
algorithm surpasses a state-of-the-art method for the same problem. Fast algorithms for MAP imply in fast algorithms 
for other related problems, for example inferences in decision networks and influence diagrams, besides the great 
interest in the MAP problem itself. Hence, this paper makes important steps in these directions. 

2. Background 

In this section we formally define the networks, the problem, and the algorithmic techniques that are used to prove 
the new complexity results as well as to devise the new algorithm for MAP. We assume that the reader is familiar with 
basic notions of complexity theory and approximation algorithms (for more details, see for example |13|,|4|,|5|]) and 
basic concepts of Bayesian networks JaLZLlsl]. 
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Definition 1 A Bayesian network (BN) J\f is defined by a triple (Q , X ', V), where Q — (Vg , Eg) « a directed acyclic 
graph with nodes Vg associated (in a one-to-one mapping) to random variables X — \X\, . . . , X n } over discrete 
domains {Vtx 1 T ■■ ,^lx„} cmdV is a collection of probability values p(xi\'Kx i ) G Qu w 'th^2 x en P( 2 -il 7r ^;) = 1> 
where x% G fix* W fl category or state of Xi and ~KXi G Xxen x . fix fl complete instantiation for the parents IIx^ o/ 
Xj /n C?. Furthermore, every variable is conditionally independent of its non-descendants given its parents. 

Given its independence assumptions among variables, the joint probability distribution represented by a BN 
(<?, X, V) is obtained by p(x) = J^i PC^iKx^)? where x G fi,y and all states Xj, 7TX; (for every i) agree with x. 

For ease of expose, we denote the singletons {Xi} and {x{\ respectively as Xi and 2^. Nodes of the graph and their 
associated random variables are used interchanged. Uppercase letters are used for random variables and lowercase 
letters for their corresponding states. Bold letters are employed for vectors/sets. We denote by z(X) the product of 
the cardinality of the variables XCA 1 , that is, z(X) = J\ x eX z(Xi), where z(Xj) = \ilx t I is the number of states 
(cardinality) of Xi. We assume z(0) = 1. We use simply z.- L to denote z(X.j). The input size of a BN is given by the 
sum of the sizes to specify the local conditional probability distributions and the space to describe the graph, that is, 
size (A/ - ) G 6(|E g |) + Y,i{zi - l ) Ilx.enx. z o e E 2 ©0(^"i U ILxJ)|0as 9(|Eg|) is clearly dominated by the 
summation J^t ®( z (Xi U IIxJ). Note that size (A/") G fi(n) and size (A/ - ) G ft(z max ), where z max = maxx i e;f Zi. 

The belief updating (BU) problem concerns the computation of p(x|e), forx G fix and e G fiE,withXUE C X 
and X n E = 0. It is known that the decision version of this problem (we denote it by Decision-BU), which can be 
stated as "is it true thatp(x\e) > r, for a given rational r", is PP-hard [9], using a reduction from MAJ-S AT (majority 
satisfiability). As p(x|e) = (£ , in the following we discuss how to compute the query p(x'), and the terms p(x, e) 
and p(e) are obtained analogously by letting x' = x, qj and x' = e, respectively: 

P (x')= Y, p(y>x')= E II pfol 71 *')' (1) 

where Xi G ftxi and 7rx ; = Xxen x .fix (f° r a ll i) agree with y,x'. Eq. (|TJ is a summation with exponentially 
many terms, each one requiring less than n multiplications. However, we can compute this huge summation in some 
specific ordering to save time. Some definitions are required here. The moral graph of a network M = (Q, X, V) 
(denoted Q m ) is obtained from Q by connecting the nodes of Q that have a common child (marrying parents), and 
then dropping the direction of all the arcs. Well-known inference methods 1 10, 11] use a tree decomposition of Q m to 
propagate results and speed up computations. 

Definition 2 Given a graph Q = (Xg, Eg), a tree decomposition TofQ is a pair (C, T), where C = {Ci, . . . Cjv} 
is a family of non-empty subsets of 'Kg, and T is a tree where the nodes are associated (in a one-to-one mapping) to 
the subsets Ci, satisfying the following properties: (i) |L Cj = Xg; (ii) For every edge E G Eg, there is a subset C, 
that contains both extremes of E; (Hi) If both Cj and Cj (with i ^ j) contain a vertex X, then all nodes of the tree in 
the (unique) path between Ci and Cj contain X as well. 

Let T = (C, T) be a tree decomposition of Q m , with C = {Ci, . . . , C n >} and n! < n (this does not imply any loss 
of generality [ 12]). Elect a node and assume all edges of T point towards the opposite direction of it. Without loss 
of generality, let Ci be this node and Ci, . . . , C n > be a topological order with respect to this graph, that is, the path 
between Ci and Cj in T does not contain any Cj> with j' > j. Let Cj = He be the only parent of Cj in the 
tree and Ac be the children of Cj. We say that T' — (C, T) is a binary decomposition if |Ac • | < 2, for every Cj 
IU3I1 . Note that it is easy to obtain a binary decomposition T' from T: include additional nodes Cj,k for each Cj that 
has more than two children (in number equal to |Ac , | — 1) such that: (i) the variables inside each Cj t k are the same 



1 Q denotes the rational numbers, which are supposed to be given by two integers defining the numerator and the denominator of the corre- 
sponding fractions. 

2 If probability values repeat in a systematic way, one could represent the network in a more compact form. We do not consider such situation. 

3 We employ Knuth's asymptotic notation f!(/), O(f) and ©(/). The reader shall not confuse the state spaces denoted by Qx (with a subscript) 
and the asymptotic notation f2(/). We use the notation g 6 f!(/) when g is Sl(/), that is, / is an asymptotic lower bound for g, and so on. 

4 The comma shall be seen as a concatenation operator, that is, p(x') = p(x, e) = p(x A e) is the probability of observing altogether the 
elements in x'. 



variables as those inside Cj; (ii) the nodes Cj.k form a chain, where the root of the chain is Cji = Cj and each 
Cj.k (k > 1) has Cj t k+i and one of the original children of Cj as its children. This transformation preserves the tree 
decomposition properties and reduces the number of children of each Cj.k to exactly two. Moreover, the maximum 
number of variables inside a single Cj is not changed (as we have just replicated Cj into the elements Cj.k), and 
the total number of nodes in the new tree is less than 2n. Binary tree decompositions are useful later during some 
derivations. 

Let X last = Cj \ Cj v be the set of nodes of Q in Cj that do not appear in Cj p (they also do not appear in any other 
node towards Ci because of the tree decomposition properties). We have the following recursion, which is processed 
in reverse topological order (from j — n' to 1): 

p(u Cj \v Cj )= Yl II P^M^) n p(u Cj ,|vc 3 ,), (2) 

where X^ roc = {X; G Cj : (X; U ELx\ ) n X last ^ 0} is the set of variables whose local probability functions 
were not processed yet (but need to be) in order to sum out over the elements X last \ X'. Furthermore, Ucj = 
p^proc y [j^ ^^ jj c ^ -j ^ x last j s com p 0sec [ f elements of Cj and descendants that are also present in the parent 

Cj and whose local probability distributions were already taken into account (they do appear in the left side of the 
conditioning bar in Cj or in its descendants), and finally Vc.,- = Cj \ (Uc.,- U X last ) are the variables that already 
appeared in the right side of the conditioning bar (but not in the left side nor they were summed out). 

The recursion of Eq. (0 formalizes the engine behind well known algorithms (for instance, bucket elimination) for 
BU in BNs. Putting in words, the values p(uc , |vc ., ) represent the information that comes from the children of Cj. 
They can be seen as functions over the domains £l\j c uv c that come from independent subtrees and are multiplied 
altogether and by the probability functions that appear for the first time in Cj (if any). Then such functions are 
summed out over the variables that do not appear in the parent of the current Cj to build the information that will be 
"propagated" to the parent C j , that is, they are all used to construct the function over Slu c u v c defined by the values 
p(uc |vc )■ The recursion is evaluated for each j, and finally p(uc a ) = p( x ') ( m this case Vd is certainly empty). 
Note that Uc U Vc C (Cj flllc ) and Uc, H Vc = 0- Because p(uc |vc ) is evaluated for each instantiation of 
uc G l!u c . and vc G ^v c that agrees with x', there are z((Uc U Vc ) \ X') < z((Cj D lie ) \ X') numbers 
to be computed. Each such computation requires at most z(X last ) (the summation has at most this number of terms) 
times at most | Ac • | + 1 multiplications. Thus, the total running time is at most 

]T(1 + |A C , |) • z(V Cj u V Ci ) • z(Xf f ) < J2(l + |A C J) • z{Cj) e 0(n ■ z% ax ), (3) 

3 3 

where w = maxj |Cj|. We have used the fact that VJ. |Ac^| < n', and n' G 0(n). This computational time is 
exponential in the size of the sets Cj of the tree, so it is reasonable to look for a decomposition with small treewidth: 

Definition 3 The treewidth of a graph Q w.r.t. the tree decomposition T = (C, T) is the maximum number of nodes 
ofQ in a node ofT minus 1; w(Q, T) = — 1 + maxj |Cj |. Moreover, w* (G) — min7- w(Q, T), with T ranging over 
all possible tree decompositions, is the minimum treewidth of a graph Q. 

Finding the tree decomposition with minimum treewidth is a NP-complete problem [ 14]. The best known approxi- 
mation method achieves an O (log n) factor of the optimal 1 15]. In spite of that, some particular BNs deserve attention: 
j\f = {Q, X, V) is called a polytree if the subjacent graph_J of Q has no cycles. A polytree j\f is further called a tree if 
each node of Q has at most one parent. For trees and poly trees, Decision-BU is solvable in polynomial time [7]. As 
we see in Eq. (fJJ, this is also true for any network of bounded treewidth. In fact the moral graph of a polytree may 
have a large treewidth if the number of parents of a variable is large. However, the input size would proportionally 
increase too, and the polynomial time on the input size would sustain. We do not discuss this case further for ease of 
expose. 



5 A subjacent graph of Q is the graph obtained by dropping the direction of the arcs. 
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The MAP problem is to find an instantiation x' G Six, with X C X \ E, such that its probability is maximized, 
that is, 

x' = argmaxp(x|e) = argmax — - ! —- — argmaxp(x, e), (4) 

xen x xesi x P{e) xeo x 

because p(e) (assumed to be non-zero) is a constant with respect to the maximization. It is known that MAP queries 
are harder than BU queries (under the assumptions that P^NP and PP^NP PP ). It is proved that the general MAP 
problem is NP pp -complete [1]. However, such proof assumes a general case of the problem, while many practical 
BNs have some structural properties that might alleviate the complexity. Two of them are very important with respect 
to the complexity of the problem: the cardinality of the variables involved in the network and the minimum treewidth 
of the moralized graph (which is expected, because this value also affects the BU complexity). The latter is considered 
in yj], and the problem is shown to be NP-complete and not approximable by a polynomial approximation scheme. 
We make use of a parametrized version of MAP to exploit these two characteristics. 

Definition 4 Given a BN J\f = (Q, X, V) where z is the maximum cardinality of any variable and w* (G m ) — w is the 
minimum treewidth of the moral graph of Q, X C X \ E, a rational r and an instantiation e G SIe, Decision-MAP - 
z-w is the problem of deciding if there is x G Six such that p(x, e) > r. MAP-z-w is the respective optimization 
version^ Furthermore, we denote by Decision-MAP-oo-w the MAP problem where z can be asymptotically as large 
as the input size(J\f) (this is the same as having no bound, as we know that size(N) G Sl(z)j. 

3. Complexity results 

The results of this section show that MAP remains NP-complete even when restricted to very simple networks. 
Previously the hardness has been proved just for polytrees where the maximum cardinality of each variable was 
Sl(size(AT)) [1] (more specifically, the proof showed that Decision-MAP-oo-w is NP-hard for w = 2, because the 
cardinality could be as large as the number of clauses in the SAT problem used in the reduction, and the number of 
clauses of a SAT problem can be (asymptotically) as large as the corresponding input size. It was also shown that 
Decision-MAP-z-oo is NP-hard for z = 2 but w unbounded). Here we show that the problem remains hard even when 
restricted to: 

• Simple binary polytrees with at most two parents per node (which directly strengthens previous results). 

• Trees with no bound on maximum cardinality but network topology as simple as a Naive Bayes structure J 1611 
(in fact such version of the problem does not admit a polynomial approximation scheme, as we show that the 
optimization version is equivalent to the optimization version of MAXSAT yj]). 



• 



Trees with bounded maximum cardinality and network topology as simple as a Hidden Markov Model structure 
II 1 711 (this result shows that Decision-MAP is hard even in trees with bounded cardinality per variable). 



Altogether these new complexity proofs strongly indicate that MAP problems are hard even when the underlying 
structure of the BN is very simple. First, Theorem [5] states the well-known fact that MAP is within NP when w is 
fixed. Then Theorems [6] [8] and[lO]present the hardness results. 

Theorem 5 Decision-MAP- z-w is in NP for any fixed w. 

Proof Pertinence in NP is trivial because Eq. Q is polynomial in size (A/") if the minimum treewidth is at most w 



(there is a linear time algorithm to find a tree decomposition of minimum treewidth when w is fixed 012ID . So, given 



an instantiation x, we can check whether p(x, e) > r (or even p(x|e) > r) by Eq. (0 in polynomial time. □ 
Theorem 6 Decision-MAP- z-w is NP-hard even ifz = w = 2. 



6 In fact the results of this paper hold in both the unconditional (as presented in Def. f4j and the conditional formulation p(x|e) > r for the 
Decision-MAP-z-ui problem, where the evidence in treated as conditioning information, because we deal with cases of bounded w and computing 
p(e) is a polynomial-time task. 



Proof Hardness is shown using a reduction horn partition problem, which is NP-hard yfl and can be stated as follows: 
given a set of m positive integers S\ , . . . , s m , is there a set I C A — {1, . . . , m} such that X^e/ s i = 12ieA\i Si '■ 
All the input is encoded using b > bits. 

Denote S = \ 'YIh^a s i an< ^ ca ^ even partition a subset I <Z A that achieves Yli<=i s * = &■ To solve partition, we 
consider the rescaled problem (dividing every element by S), so as Vi = % < 2 are the elements and we look for a 
partition with sum equals to 1 (altogether the elements sum 2). 

We construct (in polynomial time) a binary polytree (so z max = 2) with 3m + 1 nodes where the maximum 
number of parents of a node is 2, which implies that there is a tree decomposition of the moral graph with treewidth 
w = 2 (to see that, just take the same polytree and define the nodes Cj containing X, U IIx )■ The binary nodes 
are X = {X, ..., X m }, Y = {Y , Yy, . . . , Y m } and E = {Ey, ..., E m }. We denote by {xf, xf } the states 
of X (similarly for Yi and Ei). The structure of the network is presented in Figure [TJ Each Xj £ X has no 
parents and uniform distribution, each Ei has Xj as sole parent, with probability values defined as p(ef\xf) = 1 and 
p(ef\xf) = t j (the values for ef complement those to sum one), where ti is obtained by evaluating 2~ Vi up to 46 + 3 
bits of precision (and rounded up if necessary), that is, ti = 2~ Vi + error;, where < error^ < 2~ < > 4b+3 \ Clearly ti 
can be computed in polynomial time and space in b (this ensures that the specification of the Bayesian network, which 
requires rational numbers, is polynomial in b). Furthermore, note that 2~ Vi < ti < 2r Vi + error^ < 2~ Vi+2 (by 
Corollary[l6] see appendix for details). 




Figure 1: Network structure for the proof of Theorem [6] 

Y"o has no parents and p(hq) = 1. For the nodes Yi € Y, 1 < i < m, the parents are X and lj_i, and the 
probability values are p(yf\yf_ 1 ,xf) — ti,p(yf\yf_ 1 ,xf) = 1, andp(yf\y[_ 1 ,Xi) = for both states x t G &Xi- 

Note that with this specification and given the Markov condition of the network, we have (for any given x G fix) 
p(y^ n \x) = p(e T |x) = Yiiei ti> where / C A is the indicator of the elements such that X is at the state xf . Denote 

p(x, e T , -vl) = p(^|x)p(x, e T ) = p(x)p(e T |x) (l - p(yl\x)) - ^t(l - t). (5) 

This is a concave quadratic function on t with maximum at 2 _1 . Moreover, the value of t(l — t) monotonically 
increases when t approaches one half (from both sides). For a moment, suppose that ti (defined in the previous 
paragraph) is exactly 2~ Vi (instead of an approximation of it as described). In this case, 

— t(l - t) = _2-£« e j*«(l - 2" £«€*««), 

which achieves the maximum of 2^r2 _1 (l — 2 _1 ) if and only if '2~2 ieI Vi = 1, which means that there is an even 
partition. This proof would be ended here if there was not the following consideration: we must show that the 
transformation is computed in polynomial time and the parameters of a BN are rational numbers, and computing 2~ Vi 
(needed to define the BN) might be an issue. For such purpose we employ an approximate version of 2~ Vi to define 
each ti. The remainder of this proof addresses the question of how the numerical errors introduced in the definition of 
values ti interfere in the main result. Hence, note that if / is not an even partition, then we know that one of the two 
conditions hold: (i) J^iei s^ < 5 — 1 => J2iei Vi — 1 — S' or ® Sie/ s i > S + 1 => J2iei w i > 1 + g> because 
the original numbers s, are integers. Consider these two cases. 
If E, e /S*>S + l>men 

t < \\2- v ' +2 ~ ib = 2^i€/(-^+ 2 " 46 ) < 2^- {1+ ^ < 2- 1 ~ ( *-i^ ) = I, 



by using S < 2 b and m < b < 2 b . On the other hand, if £\ £/ s * - S ~ 1 > then 

t > Y[2 V =2"^-^' Ul > 2" (1 "* } = 2~ 1+ ^ > 2~ 1+ ^ =u. 

Now suppose I' is an even partition. Then we know that the corresponding t' is 

2- 1 < t' < [] 2-^+ 2 " 6 = 2^i e r>(-vi+2- ib ) < 2 - 1+ i^ = a. 
ier 

The distance of t' to 2 _1 is always less than the distance of t of a non even partition plus a gap of 2~( 3h + 2 ): 

|t' -2- 1 | + 2- (3fc+2) < a - 2- 1 + 2- (3b+2) <mm{u-2-\2- 1 -l}< |t-2 -1 |, (6) 

which is proved by analyzing the two elements of the minimization. The first term holds because 

, , 2- 6 +2- 2b , , , 1 



a + 2- {3b+2) - 2- 1 < a ■ 2^ - 2" 1 = 2~ 1+ & — - 2" 1 < 2 



i h 



The second comes from the fact that the function h(b) = a + l + 2~ (36+2 ) = 2~ 1+ ^ + 2~ 1 ~ ( ^'~^ ) + 2"( 36+2) 
is less than 1 for 6=1,2 (by inspection), it is a monotonic increasing function for b > 2 (the derivative is always 
positive), and it has lim;,_j. 00 h(b) — 1. Hence, we conclude that h(b) < 1, which implies 

a + I + 2-< 36 + 2 ) < 1 ^ a - 2- 1 + 2-( 36 + 2 ) < 2" 1 - I. 

This concludes that there is a gap of at least 2~( 36+2 ) between the worst value of t' (relative to an even partition) 
and the best value of t (relative to a non even partition), which will be used next to specify the threshold of the MAP 
problem. Now, set up X to be the MAP variables and E = e and Y m = ->y m to be the evidence, so as we verify if 

maxp(x, e T , -.j/£) >r = c- — , (7) 

where c equals a' ■ (1 — a'), and a' equals a evaluated up to 3b + 2 bits and rounded up, which implies that 2 _1 < a < 
a' < a + 2 (3b-^2> [7j gy £q ^^ a > j s c j oser to one half than any t of a non even partition, so the value c is certainly 
greater than any value that would be obtained by a non even partition. On the other hand, a' is farther from 2 _1 than 
a, so we can conclude that 

t • (1 - 1) < c < a • (1 - a) < t' • (1 - 1') 

for any t corresponding to a non-even partition and any t' of an even partition. Thus, a solution of the MAP problem 
obtains p(x, e T , _i y^) > r if and only there is an even partition. □ 

Corollary 7 Decision-MAP is NP -complete when restricted to binary polytrees. 

Proof It follows directly from Theorems|5]and|6] □ 

The next theorem shows that the problem remains hard even in trees. The tree used for the proof is probably 
the simplest practical tree: a Naive Bayes structure [16], where there is a node called "class" with direct children 
called "features". These features are independent of each other given the class. The theorem can be easily formulated 



using other trees, such as a Hidden Markov Model topology H17H . by simply replicating the node corresponding to the 
class. One strong characteristic of the following result is that the reduction is done using the maximum-satisfiability 
problem. The same problem was employed before to show that MAP is hard in polytrees 1 1]. Hence, we show here 
that the inapproximability results for MAP [ 1] when the maximum cardinality is not bounded extend to the subcase 
of trees. 



7 The conditional version of Decision-MAP could be used in the reduction too by including the term , ^ — ^r- in r, which can be computed 
in polynomial time by a BN propagation in polytrees [7], and it does not depend on the choice x. 
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Theorem 8 Decision-MAP-oo-w is NP-hard even ifw — l and the network topology is as simple as a Naive Bayes 
structure. 

Proof We use a reduction from MAX-2-S AT. Let Xi . . . , X m be variables of a SAT problem with clauses C\ , . . . , C m / 
written in 2CNF, that is, each clause is composed of a disjunction of two literals. Each literal belongs to fix = 
{xj, ~^Xj} for a given j. Without loss of generality, we assume that each clause involves exactly two distinct variables 
of Xi . . . , X m . Let b > be the number of bits to specify the MAX-2-SAT problem. 




Figure 2: Network structure for the proof of Theorem [8] 

Take a Naive Bayes shaped network. Let C be the root of the Naive Bayes structure and Y\, . . , , Y m the binary 
features (as shown in Figure [2]) such that fiy^ = {yj,y^} for every j. Define the variable C to have 2m' states 
and uniform prior, that is, p(c) = ^ for every c <E fl c , where 51 c equals to {c 1L , c 1R , c 2L ,c 2 r, ..., c m * L ,c m < R }. 
Denote by Li the literal of clause Ci with the smallest index, and by Ri the literal with the greatest index. Define the 
conditional probability functions of each Yj given C as follows: 

P(yJ\ciL) =p(yJ\ciR) = 7j \fLi,Ri £ Ojs^ , that is, X, does not appear in clause C t . 

p(yj\ci R ) = 1 if (xj = Ri) V {^ Xj = Li). 
P(yJ\dR) = if (xj = Li) V (-^Xj = Ri). 

p{yj\ciL) = 1 if Xj = Li. 
p(yf\ciL) = if-oij = L { . 

p{yj\c lL ) = 7; ifRi.en Xj - 

Define Yq as an extra feature such that p(yQ \cn,) = 1 and p(hq \cm) = 1/2 for every i. The probability values for 
yj complement these numbers in order to sum one, that is, p(yj\c) = 1 — p(yj\c) for every c e Vtc- Now, 

max p(yo,yi ) ..- > y m ) = max p(y^ ,y u . . .y m ), 
yo---Vm yu-'-Vm 

because the vector p(y^|C) (note that the vector ranges over the states of C, and y^ is fixed) pareto-dominates the 
vector p(yQ \ C) by construction, that is, p(j/j]"|c) > p{y$ \c) for every c s fie (more details onpareto sets will follow 
in Section|4] but here it is enough to see that there is no reason to choose y$ in place of y$ as the probability value of 
the latter is always greater than that of the former for every given c)Jf| It is clear that the transformation is polynomial 
in b, as the network has m + 1 nodes, with at most 2m! states (both m and m! are £l(b)), and the probability values 
are always 0, 1/2 or 1. 

By simple manipulations, we have (for any given yi, . . . , y m ): 

p(yo,yi,---ym) ^^{p{ctL)\\_p{yj\c lL ) + pi^Wpiy^n)) = — ^(Y\_p{yj\c lL ) + \\p{y 3 \c iB )) 



8 Another approach would force Y"o = y^ as evidence instead of using this pareto argument, which would also suffice to prove the theorem. 
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= ^7^31 XI (p(yiiL\CiL)p(yj iR \CiL)p(y%\CiL) +P(Vj iI ,\CiR)p(yjia\CiR)p(yS\CiR)) , 

i 

where jn, and jm, with jn, < jm, are the indices of the two variables that happen in clause C{ (the probability of all 
other variables Yj that appeared in the product have led to the fraction \ because they do not happen in d and hence 
disappeared to form the constant 2 ,„ 1 _ 2 that has been put outside the summation). Yet by construction, we continue 
with 

P{vl ',Vi> • • -2/m) = 2m , 2m _ 2 Yl [p(yjiL\ciL)^+p(yj iL \ciR)p(y jiR \ci R )- 
= 2m , 2m _! Yl (piVj.L |cii) +P(yjti Ici^Pfealcifl)) ■ 

Now note that p(j/; i£ |cii) and p(j/j 4X ,|ciij) are mutually exclusive, and that p{yj iL \cn,) = 1 if and only if Li 
satisfies clause C,;. In this case, p(yj iL \ciR.) = 0, and the sum p(yj iL \(%l) + p(yji L \ c iR)p{yjiR\ c iR) equals to 
1. On the other hand, if Li does not make clause Ci satisfiable, then p(yj iL \c%l) + p(yj iL \ c iR)p(yjiR\ c iR) = 
p(yj iR \ciR,), that is, it becomes one if and only if Ri makes clause Ci satisfiable. Because we sum over all the 
clauses, p{y$, yi, ■ ■ ■ y m ) — 2™m' < " > ^ c l auses are satisfiable in the 2CNF formula. Hence, solving the opti- 
mization maxj, ... am p(yo, y\, . . . , y m ) is the same as solving MAX-2-SAT. Because the optimization versions agree 
as described, the reduction of the decision version follows too. □ 

We show next a stronger inapproximability result than those previously stated in the literature, because we make 
use of trees while previous results make use of polytrees (or more sophisticated topologies). Recall that an approx- 
imation algorithm for a maximization problem where the exact maximum value is M > is said to achieve a ratio 
r° > 1 from the optimal if the resulting value is guaranteed to be greater than or equal to ^. We demonstrate that 
approximating Decision-MAP is NP-hard even if the network topology is as simple as a tree. This leaves no hope of 
approximating MAP in polynomial time when the number of states per variable is not bounded. 

Theorem 9 It is NP-hard to approximate Decision-MAP -oo-w, with w — 1, to any ratio r° — size(J\f) e for fixed e, 
as well as to any ratio r° — 2 s,ze ^ > , for fixed < e < 1. 

Proof We show that it is possible to reduce MAX-2-SAT to the approximate version of Decision-MAP-oo-1 in poly- 
nomial time and space in size(AA). The idea is similar to the repeated construction used in [If]. We build q copies of 
the network of Theorem ^(superscripts are added to the variables to distinguish the copies as follows: the nodes of 
the i-th copy are named C 4 , Yq, . . . , Y^) and link them by a common binary parent D of all the C* nodes (as shown 
in Figf3]l, with states {d T , d F }. We define p(d T ) = 1 and p(c* \d T ) remains uniform as before, for every node C*. 
By construction, we have 

9 g 

P(»0. • • • Vrn. • • • ' Vl • • ' Vm) = Dn^*0' *'*>••• ym\ d )p( d ) = I[p(yl !/*>••• Vm^), 

d t=l 4=1 

and hence each copy has independent computations given d T . Using the same argument as in Theorem for each 
copy, we obtain 

q k 



p{yl,---ym,---,yl,---y'L) = Y[p(yl, ■ ■ ■ yL) = Yl 



2 m m' V 2 m m' 
t=i t=i 



if and only if k clauses are satisfiable in the MAX-2-S AT problem. Suppose we want to decide if at least 1 < k' < m' 
clauses are satisfiable (the restriction of k' > 1 does not lose generality). Using the approximation over this new 
network with ratio r°, if at least k' clauses are satisfiable, then we must have 



p(vo, ■■■yL---,y q o,---y q m )>^ (^r^7 j 




Figure 3: Network structure for the proof of Theorem [9] 



On the other hand, if it is not possible to satisfy k' clauses, then we know that 

/ k' - 1 x q 
P {yl,...y 1 m ,...,yl,...y q m ) < [j^j 



Now we need to show that it is possible to pick q such that ( * m _}, ) ' 

bounded. 



2 m m' 



r' \ 2 m m' 



fc'-iV 1 
< 



k' 



r°< 



k' 



2 m ml J " r° V 2 m m! J " \ k' - 1 

k' 



q> 



and such that q is polynomially 
log(r°) 



lo § F^T 



Now, by Taylor expansion, log I jt^i ) > p- (for any kl > 1) and thus 

log(r°) 



q > fc'log(r°) =*■ q > 



k'-iV 1 

< 



logf-kL.) ' \2 m m' J " r° \2 m m' 

The proof concludes by choosing the appropriate q. In the case of r° = size (A/") e , we can choose a q > 3 such that 
> ek'(l + log(/'(6))) =► 9 > efc'(log(g) + log(/'(6)) log(g)) =► q > efc'(log(g) + log(/'(6))) 



log(q) 



=> 9 > ek'logiq ■ /'(&)) =► g > fe'log((g ■ /'(6)) £ ) =► 9 > fc'log(size(AO e ) => 9 > *'bg(r°), 

where /' is a polynomial that bounds the size of each copy, given by the construction in Theorem[8] and hence q can 
be chosen such as to ensure the polynomial transformation in the input size ( . g , < is monotonically increasing for 
g>3). 

In the case of r° = 2 slze ^ > , then we can choose a q such that 

q > (log(2)k'f(by)^ =► g 1 "* > log(2)*7'(6) e =► Q > log(2)fc'g £ ■ /'(6) e 

=* g > log(2)/c'(g • /'(6)) e =*■ g > fe'log^^'W)*) =^ q > fc'log(2 size ^ e ) =s> q > fc'log(r°), 
and again the choice of g is polynomial in the input size. □ 

To conclude this section about hardness results, we show that MAP remains hard even in a tree with bounded 
maximum cardinality. However, we demonstrate in the next sections that many practical problems can be solved 
exactly, and that fully polynomial approximation schemes are possible when cardinality and treewidth are bounded. 



Theorem 10 Decision-MAP- z-w is NP-hard even if z = 5 and w = 1 anrf the network topology is as simple as a 
Hidden Markov Model structure. 

Proof This proof uses the same problem of the proof of Theorem [6] to perform the reduction, as well as the idea 
of approximating exponentials to guarantee the polynomial time reduction and the rationality of the numbers. Thus, 
hardness is shown using a reduction horn partition problem, which is NP-hard [3] and can be stated as follows: given 
a set of m positive integers s±, . . . , s m , is there a set I C A — {1, . . . , m} such that ^2 ieI Sj = SieAXJ Si ? ^H 
the input is encoded using b > bits. Furthermore, we assume that S = h YlizA s * — 2 and we call even partition 
a subset I C A that achieves Yliel Si ~ '-'■ To s °l ve partition, we consider the rescaled problem (dividing every 
element by S), so as Vi — % < 2 are the elements and we look for a partition with sum equals to 1 (altogether the 
elements sum 2). 




Figure 4: Network structure for the proof of Theorem llOl 

We construct (in polynomial time) a tree with 3m + 1 nodes: X = {Xi, . . . , X rn }, Y = {Yi, . . . , Y m } and 
D = {D ,Di, . . . ,D m }, such that rtx t = {xn,Xi2,Xis,xu,Xi5} has 5 states, f2y 4 = {yf,yf} is binary, and 
flui = {df , df ,d*} is ternary. The structure of the network is presented in Figure [4] The probability functions are 
defined by Table [TJ (except for the probability function of Dq, which is uniform). 

First, we show that p(y) = p(yi, . . . , y m ) = w for any configuration y G £1y- By construction, we have 
p(yi\di-i) = Y, x .p(yi\xi)p(xi\di-i) = \ for any value of y^di-i. Now, 



p(yi,---,y m ) = ^2 p(y m \d m -i)p(dr, 



-l,yi,---,Vn 



-i) = ■x 53 p(d m -i,yi,---,y m -i) = -p(yi,...,y m -i)- 



Applying the same idea successively, we obtain p(yi, . . . , y m ) — o^r- Using a similar procedure, we can obtain the 
values for p(df n , y) and p{df n , y) as follows: 

p(C.y) =p(dm,yi---,y m ) = ^Z p(dm>ym\x m )p(x m \d m -i)p(d m -i,yi, . . . ,y m -i) 



X-m.drt 



U ■ \ ' P{<%n-l,Vl, • ■ • ,Vm-l) ify™ =Vm, 



I 



1 • I •p(d^_ 1 ,2/i,...,y m _i) ifj/r, 



Vm-i 



and applying successively again, we getp(df n ,yi . . . , y m ) 



Ilie/ *»' wnere where / C A is the indicator of the 



elements such that Yj is at the state y^ (the ratio | comes from the uniform p(Do)). 

P(d^,y) =p{df n ,y 1 ...,y m ) = ^ p(dm,y m \x m )p(x m \d m -i)p(d m -i,yi, ■ ■ . ,2/ m -i) 



a; m ,c/„ 



1 • | •p(d^_ 1 ,yi,...,y m _i) ifj/„ 
ij • | •p(d^_ 1 ,2/i,...,y m _i) ify„ 



T 
Vm-i 



and hence p(d^,j/i...,j/ m ) 



n 



ieA\/ ^- B ecause of that, we have 



maxp(cC,y) =max(p(y) -p(Cy) -p{d^,y)) = — -min(p(d£,y) +2>(d£,y)) 



mm 
y 



861 
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1 / i-^m v in(n^+ n *0 



3 y 



p(Xi\Xi) 



Table 1 : Probability functions used in the proof of Theorem llQI 



%il *^i2 %i3 %i4 &i§ 



1/2 
1/2 



p(**IA-i) 


Ci 


df-i 


ft* 
a i-l 


Xil 


1/2 








x a 





1/2 





xa 





1/2 





X.j4 


1/2 








•%ib 








1 



p(Di\Xi) 


X t l 


Xil 


Xt3 


Xii 


3?i5 


d{' 


U 








1 





df 





1 


u 








d* 


l-U 





l-U 





1 



Table 2: Joint probability function of Y; , Di given Xj used in the proof of Theorem llQI 



pO^AI-Xi) 


Xil 


Xi2 


x i3 


XiA 


%i5 


tf',# 


U 














tff.df 





1 











1/f.dJ 


l-<< 











1/2 













1 





yf,df 








u 








y[,dt 








l-U 





1/2 



For the sake of simplicity, consider first that U — 2 "*. Then, the function J\ ieI U + IlieAU^ ~ ^ ^2tei Vi + 
2~ ^ieA\i v i j s convex and achieves its minimum when 2~ 2U<€i Vi — 2~ ^^a\i Vi <^=^ "}2 ieI Vi = Y^,ieA\i v, = 1. 
Thus, using Y as the MAP variables and d* m as the evidence, we obtain max y p(d* n , y) = | ^r if and only if there is 
an even partition. This is still flaw in one respect: the specification of the probability functions depends on computing 
the values U, for each i £ A, which can be done only to a certain precision (we can only use a number of places that 
is polynomial in b). 

Let U be 2~ Vi computed with 6b + 3 bits of precision and rounded up (if necessary). This implies in 2~ Vi < U < 
2~ Vi+2 (by Corollary [Toll. Suppose first that / C A is an even partition. Then, 



Y[t i+ J] fc < 2£.eiC-«i+a- 

iEA\I 



iei 



2 - E<6A\i ".+2" 5b _ 2-1+2- 56 , 2 - 1 + 2 " 



2-E ieA \i^+ m2_ 



and 




b to (n"+n«))>^(i-5 22 "") 



Take now the case where I C Ais not an even partition. Without loss of generality, suppose that J2iei s i = S — I, 



with < I < S an integer. This implies in J2iei Vi = 1 — V^ anc ^ S-, 
(because i,s were rounded up, if needed): 



GA\I 



1 + Z/5. In this case, we have 



i[u+ n t ^ 2E - 

«6-A\J 



)E, 



1V -«i > 2 " 1+ ' /s + 2~ 1 ~ //s . 



sGI 



We show that 2 i+'/s _|_ 2 * '/S > 2 2 . Note that 2 i+'/s _)_ 2 * l / s [ s a convex function on / and achieves its 
minimum when I is minimized. Thus, it is enough to show that 2~ 1+1 / s + 2~ 1 ~ 1 / s > 2 2 (because Z is a positive 
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integer). Let x = 1/5. Note that 2 b < x < 1/2, because 2 < 5 < 2 b . With some manipulations we have 

2- 1+x + 2- l - x > 2 2 ~ 4b «=*> 1 + 2 2x > 2 2 ~ ib+1+x «=* log 2 (l + 2 2a; )>2- 46 + l + a ;! 

and thus it is enough to have log 2 (l + 2 2x ) — x 4 — 1 — x > (because x > 2~ b ), which follows by Lemma [171 ( as 
x < 1/2). 

At this point we know that an even partition leads to a value of Yliei U + ILeM / b tnat * s ^ ess tnan ^ 2 > while 
a non even partition to a value that is greater than 2 2 . Now, we pick a threshold r — ^-(l — |), where a equals to 
2 2 J computed with 56 + 3 bits of precision and rounded up (if necessary) to decide if the partition is even. In this 
way, 2 2 J < a < 2 2 ° +2 ° = 2 2 < 2 2 , and thus r separates the cases with even and non even partitions: 



KC,y") < ± (i - y ib J <r = ^(i-|)<^(i- y 5b ) < P (d; H ,r-)- □ 

Corollary 11 Decision-MAP is NF '-complete when the graph is restricted to a tree and variables have bounded 
cardinality. 

Proof It follows directly from Theorems 151 and [TOl □ 

4. A new algorithm for MAP 

Despite the "negative" complexity results of previous section, this section describes a considerably fast exact 
algorithm for MAP, which is later extended to run in an approximate way. Variables that are part of the MAP query 
will be denote ;f map C X \ E throughout the section. The basis to solve the problem is to compute p(x map , e) using 
Eq. © for every possible x map G £lx™t, but that would need z(X map ) = IIx eA" mp Zi eva l uat i° ns of Eq. (O, which 
is the same as a brute-force approach. Another approach is to propagate the information just as in the BU query, 
but considering many probability functions for different instantiations of x map in every possible way, yet somehow 
locally. We will use the notation p x ma P (u|v) = p(x map , u|v), where x map E O x ™p (with X map C X map , being clear 
later from the context), u g Q, v (with U C X \ .Y map ), v e O v (with Va\(UU X map )), and hence we have 
distinct functions p x mnp : J7u x ^v — > Q for distinct instantiations x map . We assume further that u, v always respect 
the instantiation of the evidence e. For convenience, p x m ap (U|V), or simply p x ma P , may be specified by the vector 
[p x <"p(u|v)]v(u€fiu,v€fi v ): ( u > v ) respecte in Q d , where d = z((U U V) \ E), as the part of u, v corresponding to e is 
fixed to the observed states. 

Let (C, T) be a tree decomposition of the network. The main idea of the algorithm is quite simple: we propagate 
through the tree only the functions p x ma P that are not dominated by others, that is, at each step we keep only the pareto 
set of all p x map . 

Definition 12 Let 5 be a subset of Q q (for a fixed dimension q), and a G 5 and b G 5 be two d-dimensional vectors 
of rationals such that cii, hi are the i-th values of a, b, respectively. We say that a dominates b, and indicate by a)~ b, 
if 0>i > hi for every 1 < i < d and Hi > bifor at least one i. A vector a G 5 is called non-dominated in S if there is 
no vector b G 5 such that b y a. A set of non- dominated vectors is a pareto set. 

Note that for each instantiation x map , the function p x ma P (U| V) is simply a multi-dimensional vector containing prob- 
ability values for each instantiation of its arguments U and V (except for those in E). In view of Def. [12] two vectors 
a = p x «*v and b = p x ™<v in the same dimension d — z((XJ U V) \ E) can be compared to verify if one dominates the 
other. The algorithm proceeds as follows: Eq. ^ is recursively evaluated, starting from the leaves of the tree. At each 

step Cj, it uses every combination of vectors p x ™p in the pareto sets previously obtained at the children nodes Cy to 

y 
compute a new pareto set of vectors p x ™p as the output of the node Cj, which will be propagated to the parent of Cj. 

c j 
The notation x™ p stands for the queried MAP variables that have already been processed, that is, they appear only in 

Cj and/or its descendants and do not appear in Cj (the parent of Cj in T). 

Px^( u c 3 |v Cj ) = Yl II ^(^l 7r ^)' II Px^/uc^ |v Cj/ ). (8) 

fi xf\,Eux»»p,^exf» c 3 ,eA Cj j ' 
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Note that all the elements xu itXi (for each X.j), and uc , ,vc , (for each j') must agree with the evidence e (this is 
done by instantiating them with the value of the evidence from the beginning). 

Now, the advantage here is to take into account that only vectors p x ™p belonging to the pareto set are able to 

produce the value that maximizes the joint probability of the queried variables. This is proved by the following 
arguments: take a node Cj and suppose that there is a child Cf € Ac such that p x ™p >- p x m «p (recall that 

both p x ™p and p x ™p are in fact numeric vectors in a z((Uc., U Vc.,) \ E) dimensional space, so they are 

' j' ' y 

comparable). Let p x ™p and p x '™p be the vectors obtained by the computations at node Cj using respectively p x ™p 

■ j ' j ' y 

and p x ™p (while all other numbers are somehow fixed). If we replace the vector p x ™p by the vector p x «p , 

■ y y ' y 

we have that every product and/or summation of Eq. (|8]i will not decrease, and at least one of them will increase, 

because Eq. © is a sum-product of non-negative numbers. This concludes that p x ™p >- p x ™p . As the final 

ic j 2C j 

objective p(x c dp , e) = p x '™p (uc, ) is certainly a non-dominated vector in one dimension (otherwise it would not be a 

maximum solution), an inductive argument over the tree decomposition suffices to show that dominated vectors may 
be discarded. 

In the worst case, the pareto set will be composed of all the vectors, and the procedure will simply run a sophis- 
ticated brute-force approach: all the candidates would be propagated through the nodes of the tree decomposition 
until the corresponding maximizations are performed. However, the expected number of elements in a pareto set 



created from random vectors is polynomial [18]. Such attractive situation can be seen in our experiments (Section 
|5}. There is another interesting property of this idea: if MAP variables cut the graph (and the corresponding tree 
decomposition is built to exploit this situation such that all variables propagated from a tree node to its parent are 
MAP variables or evidence), then the complexity of solving MAP reduces to the complexity of the subparts of the 
graph. This happens because the information to be propagated between the separate parts, that is, p x «p(uc |vc ■), 

reduces to a single number, because if MAP variables cut the graph (and the tree decomposition is built accordingly) 
such that (Ucj U V Cj ) \ E = 0, then z((U Cj U V Cj ) \ E) = z(0) = 1 (the pareto set will contain only a vector 
of size one with the maximum value related to the best configuration for the corresponding x map ), and this is pretty 
much equivalent to solving the problem separately in the subparts (in a proper order). In other words, the worst-case 
exponential time of MAP is limited to the number of MAP variables that are "visible" to each other (by visible we 
mean that there is a path in the subjacent graph between the MAP variables that does not contain any MAP variable 
or evidence). Although such situation might not be very common in general graphs, it might happen often in trees 
and polytrees, as any variable that is not a leaf or a root node always cuts the graph (in the case of networks other 
than trees, a simple transformation with an extra node per child of a MAP variable might be used to avoid connections 
introduced by the moralization of the graph, and thus keeping the cut induced by the MAP variables in the moralized 
version). For instance, if the MAP variables are randomly positioned in a tree, then the algorithm will probably run 
very fast, because the number of visible variables (to each other variable) is likely to be very small. Besides that, if an 
approximation is enough, then an adapted version of this algorithm runs in worst-case polynomial time, as we see in 
the sequel. 

Theorem 13 MAP-z-w has a FPTAS for any fixed z and w. 

Proof We need to show that, for a given e > 0, there is an algorithm A that is polynomial in size (AT) and in - such 

/ map \ 

that the value p j4 (x map ,e) obtained by A is at least — j^ — , where p(x™ t p ,e) stands for the optimal solution value of 
MAP-z-w, that is, x™ d t p = argmax xml ,p p(x map , e). We know that p(x™ dp , e) > because X) x ™p £>(x map , e) = p(e) > 
and thus the one achieving the maximum cannot be zero. In fact we have that 1 > p(e) > p(x™ d t p , e) > 2~ 9 ( SIze ('™", 
where g : Q — >• Q is a polynomial function, because p(x™ ap ,e) is obtained from a sequence of 0(nz w ) additions 
and multiplications over numbers of the input. The same argument holds for every intermediate probability value: 
we have that p(uc vc, ) > => p(uc vc, ) > 2~ 9 ( slze ( A ^\ for a given polynomial function g. Hence, let g be a 
polynomial function that satisfies such condition for every number involved in the calculations. Let (C, T) be the tree 
decomposition of the network of width w, which can be obtained in polynomial time [12], and w' — w + 1 be the 
maximum number of variables in a single node of the decomposition. Recall that each p x ™p (Uc | Vc ) is a vector in 

the dimension z((Uc U Vc ) \ E), so the idea is to show that we can fix an upper bound to the number of vectors of 
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the pareto set containing p x ™p (Uc | Vc )■ For that purpose and following the ideas of 11811 . we divide the space into 

c j 3 3 
a lattice of hypercubes such that, in each coordinate, the ratio of the largest to the smallest value is 1 + 2 w'n' ( reca U 

that n! e 0(n) is the number of nodes in the tree decomposition), which produces a number of hypercubes bounded 

//^pi >t M (9) 



because every p x ^ P (uc |vc ,) < lj and the log appears because the lattice is created from 1 to 2 sC 812 " 5 ^)), succes- 

c j 3 3 
sively dividing the coordinate by 1 + 2 J; n , (a bin for the exact zero probability is also allocated). In each hypercube, 

we keep at most one vector, so Eq. (0 bounds the number of elements in the pareto set that is computed at each node 

Cj. We call this set a reduced pareto set. This procedure is carried out over all the nodes of the tree. Therefore, we 

have a polynomial time procedure both in size (A/") and in -, because the total running time is less than Eq. (3) times 

Eq. (0 raised to 2 (using a binary tree decomposition). It remains to show that the resulting p j4 (x map , e) is at least 

. Each value p x »*p (uc.,- | vc^ ) is obtained from a sum of multiplications, with at most | Ac^ | + 1 terms each. 



p( x ^. e ) 



Hence, the approximation satisfies 

PX^,(UCV|V C .,) P x ™p(u Cj |v Cj ) 

P x 1 ™p(uc j |v Cj )> ^ [[ p(xi\ir Xi )- [[ -f > 



V\(Eux-P)^eXf C^GAc^ ( 1 +2uT'n') J ' ' 1+ 2w'n'> 



as lc , the weight of Cj (which is the number of variables that appear in nodes of the subtree of T rooted at Cj), 

/ map \ 

is equal to l Cj = \Cj\ + J2c ., eA c lc 3 > • Now takm 8 the root of r > we have that P A (x map , e) > " pl '„ — , as 

w'n' > Id (there are less than w' elements per node Cj). It is important to mention that some intermediate values 
p4m, P (uc |vc ) can be zero, but one can prove by induction in T that this is not a problem for the approximation, 

because it only happens if the corresponding exact p x ™p opt(uCj |vc^ ) is also zero. The proof is as follows: take a 

leaf as basis. In this case, p A m , P (uc | vc ) is zero only if parameters p{xi[Kx t ) of the input turn it into a zero, and the 

c j 
value is precise (it means that it is equal to p x ™p opt (uCj | v Cj ))• Now take an internal node where p A , mp (ucv,- (vq^ ) is 

zero. We have that there is a factor in each term of the summation that is zero (because it is a sum of non-negative 

numbers). If the zero is at a parameter p(xi \itXi ) of the input, then the result of the multiplication is precise (the other 

factors of the multiplication do not matter). Otherwise, if the zero is at a given p x ™p (uc ., | vc , ), then by hypothesis 

j' 
of the induction that value is precise and not a result of an approximation, so the same argument holds. This concludes 

that wherever a zero appears in the computations, it does not interfere in the approximation estimation (in fact, it can 

be only beneficial), so the computation of each p A mv (uc | vc ) is still within the — ^- ratio. 

p map (ucj i map s i map \ 

Finally, p A (x ma P,e) = p A ^Aw Cl ) > n, C \ , > > = n <T'V, , > ^el^Z, because of the inequality 
(1 + ^) r < 1 + 2e, which is valid for any < e < 1 and r a positive integer (the left-hand side is convex in e). D 

It follows from Theorem [13] that MAP-z-w has a FPTAS in any network with bounded width and cardinality, 
including polytrees. At first, this result seems to contradict past results, where it is stated that approximating MAP 
is hard even in polytrees [1]. But note that we assume a bound for the cardinality of the variables, which is the most 
common situation in practical BNs, while previous results work with a more general class of networks and do not 
assume the bound. In many applications the number of states of a variable does not increase with the number of nodes 
(in fact, it is usually much smaller). Finally, we must point out that the result of Theorem Q~3] uses a multiplicative 
approximation error, where the target is to be near the optimal solution by considering a worst-case error based on a 
small multiplicative factor, that is, if M is the true maximum value of the optimization, we can guarantee to find a 
solution with value at least M/r°, where r° > 1 is the approximation ratio. Still, it is possible to derive an additive 
approximation algorithm, where the goal is to be within an additive term with respect to the optimal value, that is, 
we would ensure to find a solution with value at least M — r 1 , with r 1 > being the additive approximation error. 
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Table 3: Average results of runs of the algorithms in many random generated networks where Samlam has solved the problem (many lines are 
missing because no instance was solved by Samlam in those cases. #Q means number of queries. SS means search space size. Within parenthesis 
and near to time results are the counts of successful runs of each algorithm, for each network type. 



Net type 


#Q 


SS 


Samlam 
time(sec) 


Approx. 
time(sec) 


Exact 
time(sec) 


Avg. 
pareto 


Avg. 
dimen. 


alarm.37.nb.(0-10) 


103 


2 2M 


35.4 


0.0(103) 


0.0(103) 


1.2 


1.4 


insurance. 27. nb.(0- 10) 
insurance. 27.nb.( 10-20) 


80 

18 


2 4 - 4 

2 13.9 


104.6 
1492.3 


0.0 (80) 
3.9(18) 


0.0 (80) 
8.9(18) 


2.5 
337.8 


9.9 

132.2 


poly. 100.(0-10) 


62 


2 -2.L 


22.2 


0.0 (62) 


0.0 (62) 


1.0 


1.2 


poly.l00.nb.(0-10) 


48 


2 Z7 


1.0 


0.0 (48) 


0.0 (48) 


1.2 


1.2 


poly.50.(0-10) 


55 


2 a.y 


14.5 


0.0 (55) 


0.0 (55) 


1.2 


1.3 


rand. 100.(0-10) 


20 


2 2M 


0.0 


0.0 (20) 


0.0 (20) 


1.0 


1.1 


rand.l00.tw4.(0-10) 


22 


2 2:i 


1.1 


0.0 (22) 


0.0 (22) 


1.4 


1.1 


rand.l00.tw8.(0-10) 


23 


211.S 


90.7 


0.0 (23) 


0.0 (23) 


1.2 


1.2 


rand.30.(0-10) 
rand.30.(10-20) 


45 
13 


2 4.5 

2 12.5 


196.4 
649.1 


0.0 (45) 
0.1(13) 


0.0 (45) 
0.1(13) 


2.3 
48.2 


2.4 
16.7 


rand.30.nb.(0-10) 
rand.30.nb.(10-20) 


21 

1 


2 2b 

2 12.0 


9.9 

1338.0 


0.0 (21) 
0.0(1) 


0.0(21) 
0.0(1) 


1.1 
49.0 


1.2 
6.0 


rand.50.(0-10) 


21 


2 2 -' 2 


37.2 


0.0(21) 


0.0(21) 


1.1 


1.2 


rand.50.nb.(0-10) 


61 


28.1 


63.7 


0.0(61) 


0.0(61) 


1.3 


1.3 


rand.50.tw4.(0-10) 


27 


2 2 - a 


82.0 


0.0 (27) 


0.0 (27) 


1.6 


1.6 


rand.50.tw8.(0-10) 


24 


2 3.U 


3.3 


0.0 (24) 


0.0 (24) 


2.0 


1.5 


rand30iw4.(0-10) 
rand30iw4.(10-20) 


57 
26 


2 b.u 

2 14.3 


241.6 
1245.1 


0.0 (57) 
0.2 (26) 


0.0 (57) 
0.9 (26) 


7.3 
101.3 


2.5 
6.4 



Additive approximation algorithms are better than their multiplicative counterpart when the optimal value is large, 
and worse when it is small. The main idea of the proof is similar to that of Theorem [13] but the lattice is built by 
dividing the space with hypercubes of uniform length (again based on e, w and n). We have chosen to present the 
multiplicative version of the proof because it is the most common in the theory of approximation algorithms. 

5. Experiments and final remarks 

We perform experiments with the exact method using the structure of some well known networks and some 
random generated networks. Tables [3] |4] and [5] show the type of the network (names presented in the first column 
follow the notation type. size. subtype. limit, where type is in {alarm,insurance,poly,rand} respectively meaning alarm, 
insurance, polytree, and random topologies; size is half of the number of nodes; subtype is in {nb,tw4,tw8} meaning 
respectively non-binary, tree-width equals to 4, tree-width equals to 8; and limit indicates that those tests correspond 
to problems where log (in base 2) of the search space is within this number), the number of queries, the size of the 
search space (which is the product of the number of states of all the queried MAP variables), the running time of 
the Samlam package [19], the running time of this procedure using an additive error of (at most) 1%, the running 
time of this procedure using the exact method, the average number of elements in the pareto set of each step, and the 
average dimensionality of the vectors in each step. Near the time results are also presented the number of successful 
runs of each algorithm. All networks have nodes with at most 5 states. Apart from the first two columns, the other 
numbers are averages over the queries. Table [3] only displays results where the Samlam package was able to solve 
the corresponding problem within one hour of computation, while Table [4] only present results where the new exact 
method solved the corresponding problems. Finally, Table [5] presents results for test instances where the new exact 
method has failed (and consequently also Samlam has failed, because it has solved only a subset of the instances that 
were solved by the new exact method). 

The number of queries in each line is not a constant because we have not generated the networks with a dimen- 
sionality constraint (that is, defining a priori the search space size), but instead the space size was verified after the 
experiments. Around 1700 tests are conducted. We divided the runs into levels of "hardness" to show the differences 
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Table 4: Average results of runs of the algorithms in many random generated networks where the new exact method succeeded to solve the problem. 
#Q means number of queries. SS means search space size. Within parenthesis and near to time results are the counts of successful runs of each 
algorithm, for each network type. 









Samlam 


Approx. 


Exact 


Avg. 


Avg. 


Net type 


#Q 


SS 


time(sec) 


time(sec) 


time(sec) 


pareto 


dimen. 


alarm.37.nb.(0-10) 


120 


2U.8 


35.4(103) 


0.0 (120) 


0.0 


1.5 


2.1 


alarm.37.nb.(10-20) 


40 


2 16.0 


- 


0.0 (40) 


0.0 


15.7 


8.8 


alarm.37.nb.(20-40) 


30 


2 29.3 


- 


0.5 (30) 


2.2 


182.7 


12.0 


alarm.37.nb.(> 40) 


10 


948.0 


- 


38.0(10) 


113.7 


1415.1 


14.0 


insurance.27.nb.(0-10) 


110 


2 0S 


104.6 (80) 


0.0(110) 


0.0 


3.1 


8.8 


insurance. 27. nb.(10-20) 


60 


2 14.0 


1492.3 (18) 


5.0 (60) 


7.2 


216.0 


163.2 


insurance.27.nb.(20-40) 


10 


926.0 


- 


272.9 (10) 


741.9 


4379.7 


82.0 


poly. 100.(0-10) 


95 


2 iA 


22.2 (62) 


0.0 (95) 


0.0 


1.3 


1.5 


poly. 100.(10-20) 


5 


2 13.8 


- 


0.0 (5) 


0.0 


95.0 


3.0 


poly.l00.nb.(0-10) 


83 


2 4.b 


1.0(48) 


0.0 (83) 


0.0 


2.0 


1.8 


poly.l00.nb.Q0-20) 


13 


2 15.2 


- 


0.0(13) 


0.0 


6.5 


4.2 


poly.l00.nb.(20-40) 


3 


2 28.7 


- 


1.0(3) 


1.0 


204.7 


13.0 


poly.50.(0-10) 


81 


2 4.3 


14.5 (55) 


0.0(81) 


0.0 


1.9 


1.8 


poly.50.(10-20) 


16 


2 16.0 


- 


0.0(16) 


0.0 


11.9 


4.5 


poly.50.(20-40) 


3 


2 25.3 


- 


0.0 (3) 


0.0 


30.0 


5.7 


rand. 100.(0-10) 


31 


2 3.3 


0.0 (20) 


0.0(31) 


0.0 


1.4 


1.5 


rand. 100.(10-20) 


10 


2 14.3 


- 


0.0(10) 


0.0 


8.5 


4.2 


rand. 100.(20-40) 


6 


2 24.0 


- 


1.2 (6) 


3.7 


362.2 


22.2 


rand,100.tw4.(0-10) 


49 


2 b.u 


1.1 (22) 


0.0 (49) 


0.0 


6.5 


2.3 


rand.l00.tw4.(10-20) 


26 


2 14.9 


- 


0.5 (26) 


13.8 


477.5 


6.0 


rand. 1 00. tw4. (20-40) 


13 


2 26.1 


- 


8.8(13) 


152.7 


920.4 


6.8 


rand.l00.tw4.(> 40) 


2 


2 48.5 


- 


16.0 (2) 


44.0 


557.5 


10.0 


rand.l00.tw8.(0-10) 


36 


2 3.9 


90.7 (23) 


0.0 (36) 


0.0 


6.5 


2.1 


rand.l00.tw8.(10-20) 


26 


2 1B.S 


- 


6.8 (26) 


20.9 


507.1 


16.3 


rand.l00.tw8.(20-40) 


19 


2 26.9 


- 


60.4 (19) 


451.6 


1499.6 


29.4 


rand.l00.tw8.(> 40) 


2 


2 45.5 


- 


471.0(2) 


2094.5 


2635.5 


39.5 


rand.30.(0-10) 


45 


2 4.5 


196.4 (45) 


0.0 (45) 


0.0 


2.3 


2.4 


rand.30.(10-20) 


31 


2 15.2 


649.1 (13) 


9.8(31) 


105.1 


862.0 


40.2 


rand.30.(20-40) 


4 


2 21.8 


- 


908.0 (4) 


1569.0 


6431.5 


207.2 


rand.30.nb.(0-10) 


29 


2 3.9 


9.9 (21) 


0.0 (29) 


0.0 


2.7 


1.8 


rand.30.nb.(10-20) 


16 


2 17.1 


1338.0(1) 


1.7(16) 


2.3 


336.8 


15.5 


rand.30.nb.(20-40) 


11 


2 25.5 


- 


189.0(11) 


638.7 


3708.5 


48.9 


rand.50.(0-10) 


35 


2 4.2 


37.2 (21) 


0.0 (35) 


0.0 


1.9 


2.2 


rand.50.(10-20) 


20 


2 15.9 


- 


4.2 (20) 


9.8 


490.3 


21.6 


rand.50.(20-40) 


8 


2 2B.l 


- 


71.2(8) 


942.8 


5389.1 


103.6 


rand.50.nb.(0-10) 


83 


2 4.1 


63.7 (61) 


0.0 (83) 


0.0 


1.9 


1.7 


rand.50.nb.(10-20) 


12 


2 15.5 


- 


0.0(12) 


0.2 


84.8 


4.6 


rand.50.nb.(20-40) 


4 


2 26.5 


- 


0.0 (4) 


0.0 


39.0 


8.0 


rand.50.tw4.(0-10) 


47 


2 4.4 


82.0 (27) 


0.0 (47) 


0.0 


3.3 


2.2 


rand.50.tw4.(10-20) 


27 


2 15.9 


- 


2.6 (27) 


44.5 


642.1 


5.5 


rand.50.tw4.(20-40) 


16 


2 27.9 


- 


39.1 (16) 


741.4 


2861.4 


8.2 


rand.50.tw8.(0-10) 


38 


2 4.9 


3.3 (24) 


0.0 (38) 


0.0 


4.0 


2.5 


rand.50.tw8.(10-20) 


28 


2 15.7 


- 


2.7 (28) 


52.5 


598.1 


17.9 


rand.50.tw8.(20-40) 


20 


2 23.5 


- 


129.9 (20) 


274.4 


890.0 


33.9 


rand30iw4.(0-10) 


57 


2 5.0 


241.6(57) 


0.0 (57) 


0.0 


7.3 


2.5 


rand30iw4.(10-20) 


41 


2 1B.0 


1245.1 (26) 


0.3 (41) 


1.2 


144.6 


6.8 


rand30iw4.(20-40) 


1 


922.0 


- 


1.0(1) 


2.0 


118.0 


10.0 
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Table 5: Average results of runs of the algorithms in many random generated networks where the new exact method failed to solve the problem 
(consequently Samlam also failed, as it has solved a subset of the cases that the new exact method could solve). #Q means number of queries. SS 
means search space size. Within parenthesis and near to time results are the counts of successful runs of the approximation algorithm. 



Net type 


#Q 


SS 


Approx. 
time(sec) 


insurance.27.nb.(20-40) 


20 


r,33.U 


- 


poly.l00.nb.(> 40) 


1 


QbO.U 


0.0(1) 


rand. 100.(10-20) 
rand. 100.(20-40) 
rand.l00.(> 40) 


5 

17 
28 


2 18.4 
2 29.9 
o55.6 


1222.7 (3) 


rand.l00.tw4.(20-40) 
rand.l00.tw4.(> 40) 


6 

4 


9^8. b 
o48.8 


1434.2 (4) 
837.5 (2) 


rand.l00.tw8.(20-40) 
rand.l00.tw8.(> 40) 


12 
5 


2 47.0 


1057.0 (4) 
85.0(1) 


rand.30.(10-20) 
rand.30.(20-40) 


1 

19 


2^0.0 

2 24.3 


1137.0(1) 
263.0 (3) 


rand.30.nb.(20-40) 
rand.30.nb.(> 40) 


11 

17 


2 3i.b 
2 48.1 


1009.3 (3) 


rand.50.(20-40) 
rand.50.(> 40) 


30 

7 


2 33.8 
2 42.4 


1288.5 (2) 


rand.50.nb.(20-40) 


1 


232. U 


0.0(1) 


rand.50.tw4.(10-20) 
rand.50.tw4.(20-40) 
rand.50.tw8.(20-40) 


1 
9 
14 


2 2U.U 
2 28.1 
2 28.2 


11.0(1) 
289.5 (4) 
943.4 (8) 


rand30iw4.(20-40) 


1 


222. U 


3537.0(1) 



in running time. MAP variables were connected to the original network variables (using uniform priors) such that 
they are always in extreme nodes (roots or leaves), which in general generates hard instances (this is the reason why 
the number of nodes is twice as many). For example, Alarm.37 has in fact 37 x 2 = 74 nodes, and so on. The 
search space means the number of BU queries to solve the problem by a brute-force approach. The results show 
that we can exactly solve MAP in networks of practical size. We have also run the state-of-the-art algorithm of the 
Samlam system 11911 . The cells of the tables marked with a dashed indicate that the method was unable to output the 
answer after one hour of computation for any test case. Otherwise, the number within parenthesis indicates how many 
problems were solved out of the total. Even if we have used the additive approximation, the approximation algorithm 
provided results that are also within 1% in the multiplicative sense for 99.8% of the tests (and in the few 0.2% of 
cases where it has not achieved the 1.01 factor, it was still below a multiplicative approximation factor of 1.03 from 
the optimal value). Furthermore, the approximation results are in fact exact in 67% of the cases (where we have the 
exact solution to compare against), and have only 0.09% error (in average) in the remaining cases (much better than 
the worst error guaranteed by the method). Yet the comparison with Samlam shall be viewed in a broad perspective, 
because differences in the implementations might affect the results. For instance, the tree decomposition (an NP-hard 
problem) has not been optimized. Nevertheless, the new algorithm reduces time costs by orders of magnitude. 

The main bottleneck of the algorithm is shared by many methods: the treewidth. While the constants and expo- 
nents that are hidden by the asymptotic notation in the analysis of the exact method are less aggressive than they look 
to be in a first moment, the complexity result of the multiplicative FPTAS might be seen at first as theoretical, because 
the number of hypercubes that are used to divide the vector space (given by Eq. (O) is huge. Still it must be noted 
that the implementation of the FPTAS idea is not asymptotically slower than that of the exact algorithm, as we always 
keep only a subset of the full pareto set that is used at each level of the computation of the exact algorithm, but the 
division is so granulated that the number of discarded vectors (belonging to the same hypercube) is small in many 
cases, and the overhead to discard them might not be always computationally attractive. Still, Table [5] presents the 
cases that were not solved by the exact methods but were solved by the approximation method within one hour, which 
shows that the approximation algorithm can go beyond what the exact method can do in some cases. 
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In summary, this paper closes a few theoretical gaps with respect to the MAP problem in Bayesian networks and 
presents an efficient algorithm compared to currently available methods. It is also shown that a good approximation 
with theoretical guarantees is possible, but it is necessary to work on reducing the hidden constants and exponents of it. 
This is a point to be addressed in future work, as well as trying to devise an efficient pseudo-polynomial time algorithm 
(in fact, both ideas are strongly correlated). Another possibility is to study (theoretically and empirically) how to select 
vectors from the pareto sets in order to further reduce their sizes, which may produce very good approximation results 
in short time (but eventually without theoretical guarantees). 
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Appendix A. Additional results used in the proofs of theorems 

We present here the proofs that are not central to the arguments of the theorems of this paper. In fact they describe 
some simple mathematical relations that are exploited to ensure that the complexity reductions are performed in 
polynomial time. 

Lemma 14 Let k > 1 be an integer and f : M —¥ Q and g : Af — > M two functions such that f(k) > 1 + log 2 g(k) 
andg(k) > 1. Then 2~^ fe ) < 2~^> - 1. 

Proof First note that 2~ 4 ^ fe ) < (2 l0 ^ 2 s( k )yt = i for any < { < g ^ and that (sW) < g ^)\ Now, 

q(k) q(k) 

( 2 -/(fc) + i)s« = ]T ( 3(fc) ) • 2-^w < JTg(ky ■ 2-^« 



g(k) , g(k) 

I * , I 

2. 
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Finally, 

(2-/W + l)9( fe ) < 2 =>- (2~ /(fe) + 1) < 2^7 =» 2~ /(fc) < 2^T - 1. □ 

Lemma 15 Let v > 2~ b be a rational number and k > 1 an integer. Then v + 2~( fe + tl + 1 ) < v ■ 2^ and v 
2 -(fc+6+|) > v . 2 _ i^_ 

Proof The first inequality follows from Lemma [T4l with f(k) = k + l and g(k) = 2 k : 

2 -(fc+i) < 2 5F _ i ^ 2 -( fc+1 ) • 2~ b < v • (2* - 1) =^ v + 2- (fe+b+1) < v ■ 2* 

The other result is analogous using /(£;) = fc — 7^7 + § and <?(£;) = 2 k : 

i 
2 -( fe -iV+D < 2* - 1 ^ 2-( fe +D < ^^i => 2-( fc +D < 1 - 2~£ => 

22 k 

2 -(fc+|)-6 < w . (1 _ 2 -it) ^ v _ 2 -(fe+b+|) > v . 2~£. D 

Corollary 16 2~ + 2~( fe + 3 ) < 2 _u+ ^ one/ 2~ v - 2~( fe + 4 ) > 2~ v ~^ for every v<2 and integer k > 1. 

Proof It follows from Lemma[T5lwith b = 2. D 

Lemma 17 Given a rational < x < ^, we have that log 2 (l + 2 2x ) — x 4 — a; — 1 > 0. 

Proof We apply a Taylor expansion around zero for the expression as follows: 

io g2 (i + 2-) -*- x -i e (^ - i2+ ! i : g2)3 ^ + o^) 



2 12 

And then some manipulations are enough to show the desired result: 

log 2 _ 12 + (log 2) 3 > log 2 , _ 12 + (log 2) 3 2 > 

2 48 2 ^ 48 X ~ 

log2 112 + (log2)3 log2 12 + (log2)3 4 

2 4 12 " 2 12 " 62V ; 
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